81 research outputs found

    Daniel@FinTOC-2019 Shared Task : TOC Extraction and Title Detection

    Get PDF
    International audienceWe present different methods for the two tasks of the 2019 FinTOC challenge: Title Detection and Table of Contents Extraction. For the Title Detection task we present different approaches using various features : visual characteristics , punctuation density and character n-grams. Our best approach achieved an official F-measure score of 94.88%, ranking 6 on this task. For the TOC extraction task, we presented a method combining visual characteristics of the document layout. With this method we ranked first on this task with 42.72%

    Évaluation intrinsèque et extrinsèque du nettoyage de pages Web

    Get PDF
    International audienceIn this article, we tackle the problem of evaluation of web page cleaning tools. This task is seldom studied in the literature although it has consequences on the linguistic processing performed on web-based corpora. We propose two types of evaluation : (I) an intrinsic (content-based) evaluation with measures on words, tags and characters ; (II) an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that the results are not consistent in both evaluations. We show as well that there are important differences in the results between the studied languages. We conclude that the choice of a web page cleaning tool should be made in view of the aimed task rather than on the performances of the tools in an intrinsic evaluation.Le nettoyage de documents issus du web est une tâche importante pour le TAL en général et pour la constitution de corpus en particulier. Cette phase est peu traitée dans la littérature, pourtant elle n'est pas sans influence sur la qualité des informations extraites des corpus. Nous proposons deux types d'évaluation de cette tâche de détourage : (I) une évaluation intrinsèque fondée sur le contenu en mots, balises et caractères ; (II) une évaluation extrinsèque fondée sur la tâche, en examinant l'effet du détourage des documents sur le système placé en aval de la chaîne de traitement. Nous montrons que les résultats ne sont pas cohérents entre ces deux évaluations ainsi qu'entre les différentes langues. Ainsi, le choix d'un outil de détourage devrait être guidé par la tâche visée plutôt que par la simple évaluation intrinsèque

    Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles

    Get PDF
    International audienceThis article studies the application of the #BenderRule in Natural Language Processing (NLP) articles according to two dimensions. Firstly, in a contrastive manner, by considering two major international conferences, LREC and ACL, and secondly, in a diachronic manner, by inspecting nearly 14,000 articles over a period of time ranging from 2000 to 2020 for LREC and from 1979 to 2020 for ACL. For this purpose, we created a corpus from LREC and ACL articles from the above-mentioned periods, from which we manually annotated nearly 1,000. We then developed two classifiers to automatically annotate the rest of the corpus. We show that LREC articles tend to respect the #BenderRule (80 to 90% of them respect it), whereas only around half of ACL articles do. Interestingly, over the considered periods, the results appear to be stable for the two conferences, even though a rebound in ACL 2020 could be a sign of the influence of the blog post about the #BenderRule

    Langues « par défaut » ? Analyse contrastive et diachronique des langues non citées dans les articles de TALN et d'ACL

    Get PDF
    National audienceWe study the application of the #BenderRule in natural language processing articles, taking into account a contrastive and a diachronic dimensions, by examining the proceedings of two NLP conferences, TALN and ACL, over time. A sample of articles was annotated manually and two classifiers were developed to automatically annotate the remaining articles. This allows us to quantify the extent to which the #BenderRule is applied and to show a slight advantage in favor of TALN.Cet article étudie l'application de la #RègledeBender dans des articles de traitement automatique des langues (TAL), en prenant en compte une dimension contrastive, par l'examen des actes de deux conférences du domaine, TALN et ACL, et une dimension diachronique, en examinant ces conférences au fil du temps. Un échantillon d'articles a été annoté manuellement et deux classifieurs ont été développés afin d'annoter automatiquement les autres articles. Nous quantifions ainsi l'application de la #RègledeBender, et mettons en évidence un léger mieux en faveur de TALN sur cet aspect

    Ambiguity Diagnosis for Terms in Digital Humanities

    Get PDF
    International audienceAmong all researches dedicating to terminology and word sense disambiguation, little attention has been devoted to the ambiguity of term occurrences. If a lexical unit is indeed a term of the domain, it is not true, even in a specialised corpus, that all its occurrences are terminological. Some occurrences are terminological and other are not. Thus, a global decision at the corpus level about the terminological status of all occurrences of a lexical unit would then be erroneous. In this paper, we propose three original methods to characterise the ambiguity of term occurrences in the domain of social sciences for French. These methods differently model the context of the term occurrences: one is relying on text mining, the second is based on textometry, and the last one focuses on text genre properties. The experimental results show the potential of the proposed approaches and give an opportunity to discuss about their hybridisation

    SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German

    Get PDF
    International audienceIn this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results

    Highlighting a New Morphospecies within the Dialium Genus Using Leaves and Wood Traits

    Full text link
    peer reviewedDuring inventories of lesser-known timber species in eastern Gabon, a new Dialium morphospecies was discovered. To discriminate it from the two other 2–5 leaflets Dialium species, 25 leaf traits were measured on 45 trees (16 Dialium pachyphyllum, 14 Dialium lopense, 15 Dialium sp. nov.). Nine wood chemical traits, as well as infrared spectra, were also examined on harvestable trees (four Dialium pachyphyllum and four Dialium sp. nov.). This study revealed seven discriminant leaf traits that allowed to create a field identification key. Nine significant differences (five in sapwood and four in heartwood) in terms of wood composition were highlighted. The use of the PLS-DA technique on FT-IR wood spectra allowed to accurately identify the new morphospecies. These results provide strong support for describing a new species in this genus. Implications for sustainable management of its populations are also discussed.Essence à Haut Potentiel de Valorisation (EHPva

    Structure patterns in Information Extraction: a multilingual solution?

    No full text
    International audienceIE systems nowadays work very well, but they are mostly monolingual and difficult to convert to other languages. We maybe have then to stop thinking only with traditional pattern-based approaches. Our project, PULS, makes epidemic surveillance through analysis of On-Line News in collaboration with MedISys, developed at the European Commission's Joint Research Centre (EC-JRC). PULS had only an English pattern-based system and we worked on a pilot study on French to prepare a multilingual extension. We will present here why we chose to ignore classical approaches and how we can use it with a mainly language-independent based only on discourse properties of press articlestructure. Our results show a precision of 87% and a recall of 93%. And we have good reasons to think that this approach will also be efficient for other languages

    Veille épidémiologique multilingue : une approche parcimonieuse au grain caractèrefondée sur le genre textuel

    No full text
    In this dissertation we tackle the problem of multilingual epidemic surveillance.The approach advocated here which is differential, endogenous and noncompositionnal.We maximise the factorization by using genre properties andcommunication principles. Our local analysis does not rely on classical linguisticanalyzers for morphology, syntax or semantics. The distribution of characterstrings at key positions is exploited, thus avoiding the problem of the definitionof a "word". We implemented DAnIEL (Data Analysis for Information Extractionin any Language), a system using this approach. DanIEL analyzes pressarticles in order to detect epidemic events. DAnIEL is fast in comparison tostate-of-the-art systems. It needs very few additional knowledge for processingnew languages. DAnIEL is also evaluated on the analysis of scientific articlesfor classification and keyword extraction. Finally, we propose to use DAnIELoutputs to perform a task-based evaluation of boilerplate removal systems.Cette thèse explore la problématique du multilinguisme en recherche d’information.Nous présentons une méthode de veille sur la presse adaptée autraitement du plus grand nombre de langues possible. Le domaine spécifiqued’étude est la veille épidémiologique, domaine pour lequel une couverture laplus large possible est nécessaire. La méthode employée est différentielle, noncompositionnelleet endogène. Notre but est de maximiser la factorisation pourtraiter de nouvelles langues avec un coût marginal minimal. Les propriétésdu genre journalistique sont exploitées, en particulier la répétition d’élémentsà des positions clés du texte. L’analyse au grain caractère permet d’être indépendantdes contraintes posées par le mot graphique dans de nombreuseslangues. Nous aboutissons à l’implantation du système DAnIEL (Data Analysisfor Information Extraction in any Language). DAnIEL analyse les documentspour déterminer s’ils décrivent des faits épidémiologiques et les regrouper parpaires maladie-lieu. DAnIEL est rapide et efficace en comparaison des systèmesexistants et nécessite des ressources très légères. Nous montrons d’autres applicationsde DAnIEL pour des tâches de classification et d’extraction de mots-clésdans des articles scientifiques. Enfin, nous exploitons les résultats de DAnIELpour évaluer des systèmes de nettoyage de page web
    corecore